54 research outputs found
Two new Probability inequalities and Concentration Results
Concentration results and probabilistic analysis for combinatorial problems
like the TSP, MWST, graph coloring have received much attention, but generally,
for i.i.d. samples (i.i.d. points in the unit square for the TSP, for example).
Here, we prove two probability inequalities which generalize and strengthen
Martingale inequalities. The inequalities provide the tools to deal with more
general heavy-tailed and inhomogeneous distributions for combinatorial
problems. We prove a wide range of applications - in addition to the TSP, MWST,
graph coloring, we also prove more general results than known previously for
concentration in bin-packing, sub-graph counts, Johnson-Lindenstrauss random
projection theorem. It is hoped that the strength of the inequalities will
serve many more purposes.Comment: 3
Clustering with Spectral Norm and the k-means Algorithm
There has been much progress on efficient algorithms for clustering data
points generated by a mixture of probability distributions under the
assumption that the means of the distributions are well-separated, i.e., the
distance between the means of any two distributions is at least
standard deviations. These results generally make heavy use of the generative
model and particular properties of the distributions. In this paper, we show
that a simple clustering algorithm works without assuming any generative
(probabilistic) model. Our only assumption is what we call a "proximity
condition": the projection of any data point onto the line joining its cluster
center to any other cluster center is standard deviations closer to
its own center than the other center. Here the notion of standard deviations is
based on the spectral norm of the matrix whose rows represent the difference
between a point and the mean of the cluster to which it belongs. We show that
in the generative models studied, our proximity condition is satisfied and so
we are able to derive most known results for generative models as corollaries
of our main result. We also prove some new results for generative models -
e.g., we can cluster all but a small fraction of points only assuming a bound
on the variance. Our algorithm relies on the well known -means algorithm,
and along the way, we prove a result of independent interest -- that the
-means algorithm converges to the "true centers" even in the presence of
spurious points provided the initial (estimated) centers are close enough to
the corresponding actual centers and all but a small fraction of the points
satisfy the proximity condition. Finally, we present a new technique for
boosting the ratio of inter-center separation to standard deviation
Spectral Approaches to Nearest Neighbor Search
We study spectral algorithms for the high-dimensional Nearest Neighbor Search
problem (NNS). In particular, we consider a semi-random setting where a dataset
in is chosen arbitrarily from an unknown subspace of low
dimension , and then perturbed by fully -dimensional Gaussian noise.
We design spectral NNS algorithms whose query time depends polynomially on
and (where ) for large ranges of , and . Our
algorithms use a repeated computation of the top PCA vector/subspace, and are
effective even when the random-noise magnitude is {\em much larger} than the
interpoint distances in . Our motivation is that in practice, a number of
spectral NNS algorithms outperform the random-projection methods that seem
otherwise theoretically optimal on worst case datasets. In this paper we aim to
provide theoretical justification for this disparity.Comment: Accepted in the proceedings of FOCS 2014. 30 pages and 4 figure
Random Separating Hyperplane Theorem and Learning Polytopes
The Separating Hyperplane theorem is a fundamental result in Convex Geometry
with myriad applications. Our first result, Random Separating Hyperplane
Theorem (RSH), is a strengthening of this for polytopes. \rsh asserts that if
the distance between and a polytope with vertices and unit diameter
in is at least , where is a fixed constant in ,
then a randomly chosen hyperplane separates and with probability at
least and margin at least .
An immediate consequence of our result is the first near optimal bound on the
error increase in the reduction from a Separation oracle to an Optimization
oracle over a polytope.
RSH has algorithmic applications in learning polytopes. We consider a
fundamental problem, denoted the ``Hausdorff problem'', of learning a unit
diameter polytope within Hausdorff distance , given an optimization
oracle for . Using RSH, we show that with polynomially many random queries
to the optimization oracle, can be approximated within error .
To our knowledge this is the first provable algorithm for the Hausdorff
Problem. Building on this result, we show that if the vertices of are
well-separated, then an optimization oracle can be used to generate a list of
points, each within Hausdorff distance of , with the property
that the list contains a point close to each vertex of . Further, we show
how to prune this list to generate a (unique) approximation to each vertex of
the polytope. We prove that in many latent variable settings, e.g., topic
modeling, LDA, optimization oracles do exist provided we project to a suitable
SVD subspace. Thus, our work yields the first efficient algorithm for finding
approximations to the vertices of the latent polytope under the
well-separatedness assumption
Principal Component Analysis and Higher Correlations for Distributed Data
We consider algorithmic problems in the setting in which the input data has
been partitioned arbitrarily on many servers. The goal is to compute a function
of all the data, and the bottleneck is the communication used by the algorithm.
We present algorithms for two illustrative problems on massive data sets: (1)
computing a low-rank approximation of a matrix ,
with matrix stored on server and (2) computing a function of a vector
, where server has the vector ; this
includes the well-studied special case of computing frequency moments and
separable functions, as well as higher-order correlations such as the number of
subgraphs of a specified type occurring in a graph. For both problems we give
algorithms with nearly optimal communication, and in particular the only
dependence on , the size of the data, is in the number of bits needed to
represent indices and words ().Comment: rewritten with focus on two main results (distributed PCA,
higher-order moments and correlations) in the arbitrary partition mode
Characterization of a distinct lethal arteriopathy syndrome in twenty-two infants associated with an identical, novel mutation in FBLN4 gene, confirms fibulin-4 as a critical determinant of human vascular elastogenesis
Background: Vascular elasticity is crucial for maintaining hemodynamics. Molecular mechanisms involved in human elastogenesis are incompletely understood. We describe a syndrome of lethal arteriopathy associated with a novel, identical mutation in the fibulin 4 gene (FBLN4) in a unique cohort of infants from South India.
Methods: Clinical characteristics, cardiovascular findings, outcomes and molecular genetics of twenty-two infants from a distinct population subgroup, presenting with characteristic arterial dilatation and tortuosity during the period August 2004 to June 2011 were studied.
Results: Patients (11 males, 11 females) presented at median age of 1.5 months, belonging to unrelated families from identical ethno-geographical background; eight had a history of consanguinity. Cardiovascular features included aneurysmal dilatation, elongation, tortuosity and narrowing of the aorta, pulmonary artery and their branches. The phenotype included a variable combination of cutis laxa (52%), long philtrum-thin vermillion (90%), micrognathia (43%), hypertelorism (57%), prominent eyes (43%), sagging cheeks (43%), long slender digits (48%), and visible arterial pulsations (38%). Genetic studies revealed an identical c.608A > C (p. Asp203Ala) mutation in exon 7 of the FBLN4 gene in all 22 patients, homozygous in 21, and compound heterozygous in one patient with a p. Arg227Cys mutation in the same conserved cbEGF sequence. Homozygosity was lethal (17/21 died, median age 4 months). Isthmic hypoplasia (n = 9) correlated with early death (<= 4 months).
Conclusions: A lethal, genetic disorder characterized by severe deformation of elastic arteries, was linked to novel mutations in the FBLN4 gene. While describing a hitherto unreported syndrome in this population subgroup, this study emphasizes the critical role of fibulin-4 in human elastogenesis
Algorithmic Geometry of Numbers
this article - Algorithmic Geometry of Numbers. The fundamental basis reduction algorithm of Lov'asz which first appeared in Lenstra, Lenstra, Lov'asz [46] was used in Lenstra's algorithm for integer programming and has since been applied in myriad contexts-starting with factorization of polynomials (A.K. Lenstra, [45]). Classical Geometry of Numbers has a special feature in that it studies the geometric properties of (convex) sets like volume, width etc. which come from the realm of continuous mathematics in relation to lattices which are discrete objects. This makes it ideal for applications to integer programming and other discrete optimization problems which seem inherently harder than their "continuous" counterparts like linear programming.
- …